In our case study, we are employees of the fictitious company “217”, which produces sheet metal parts for vehicle bodies of German car manufacturers. Recently, there have been more and more customer complaints about premature rust spots on the company products. Intensive research revealed that this may be caused by the purchase of steel whose composition did not comply with the company’s specifications. An internal investigation revealed that this steel was processed in the company plants in the period from June 1, 2012 to July 31, 2014. Our goal is to analyze this situation with the production data and parts lists of our customers and to identify the customers and vehicles affected.Therefore we aim to identify critical production quantities, isolate them and inform customers and car manufacturers who have installed critical components in their vehicles from the available data.
We possess the information that body parts produced by our company are installed in “Type11”, “Type12”, “Type21” and “Type22” vehicles of “OEM1” and “OEM2” car brands.
Lastly, for reference, we also have the following translation which will be helpful for understanding the data:
if(!require("install.load")){
install.packages("install.load")
}
library('install.load')
install_load("readr", "readxl", "tidyr", "dplyr", "stringr", "tidyverse", "knitr", "rmarkdown", "ggplot2","lubridate","kableExtra")First, we will construct a function that will enable us to read messy flat files (.txt)
read_txt <- function(path, replacement){
path %>% # Take the file from the path
read_file() %>% # Import it as a string
str_replace_all(replacement) %>% # Apply a specific replacement for each dataset
str_replace_all(c("\n\"\\d+\"," = "\n", "\nNA," = "\n")) %>% # Apply general replacement
read_delim(show_col_types = FALSE) # Read it as a dataframe
}Recall the fact that we know our parts are installed in “Type11”, “Type12”, “Type21” and “Type22” vehicles of “OEM1” and “OEM2” car brands. Furthermore, we also know that the provided data are in the form relational databases which are combined through specific keys/IDs. Leading us to the following graph:
This information enabled us to identify the relevant datasets for our analysis, which are:
After having identified the relevant databases for our analysis, it is time to import these files:
geodata <- read_csv2("Data/Geodaten/Geodaten_Gemeinden_v1.2_2017-08-22_TrR.csv")registrations <- read_csv2("Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv")car_type11 <- read_csv("Data/Fahrzeug/Fahrzeuge_OEM1_Typ11.csv")
car_type12 <- read_csv2("Data/Fahrzeug/Fahrzeuge_OEM1_Typ12.csv")
car_type21 <- read_csv("Data/Fahrzeug/Fahrzeuge_OEM2_Typ21.csv")
car_type22 <- read_csv2("Data/Fahrzeug/Fahrzeuge_OEM2_Typ22.csv")car_parts_type11 <-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ11.csv")
car_parts_type12 <-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ12.csv")
car_parts_type21 <-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ21.csv")
car_parts_type22 <-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ22.csv")comp_4 <- read_csv2("Data/Komponente/Komponente_k4.csv")
comp_5 <- read_csv("Data/Komponente/Komponente_k5.csv")
comp_6 <- read_csv2("Data/Komponente/Komponente_k6.csv")
comp_7 <- read_txt("Data/Komponente/Komponente_K7.txt",
replacement = c("\t" = ",", "\r" = "\n"))comp_parts_4 <- read_csv2("Data/Komponente/Bestandteile_Komponente_k4.csv")
comp_parts_5 <- read_csv2("Data/Komponente/Bestandteile_Komponente_k5.csv")
comp_parts_6 <- read_csv2("Data/Komponente/Bestandteile_Komponente_k6.csv")
comp_parts_7 <- read_csv2("Data/Komponente/Bestandteile_Komponente_k7.csv")P_30 <- read_csv("Data/Einzelteil/Einzelteil_T30.csv")
P_31 <- read_txt("Data/Einzelteil/Einzelteil_T31.txt",
replacement = c(" " = ",", "\b" = "\n"))
P_32 <- read_csv2("Data/Einzelteil/Einzelteil_T32.csv")
P_33 <- read_csv("Data/Einzelteil/Einzelteil_T33.csv")
P_34 <- read_txt("Data/Einzelteil/Einzelteil_T34.txt",
replacement = c(" \\| \\| " = ",", "\"\"" = "\"\n\""))
P_35 <- read_txt("Data/Einzelteil/Einzelteil_T35.txt",
replacement =c("\\\\" = ",", "\"\"" = "\"\n\"", "NA\"" = "NA\n\""))
P_37 <- read_csv("Data/Einzelteil/Einzelteil_T37.csv")In this section, we will proceed to tidy, clean and filter the datasets imported in the previous section. The goal is to have all datasets organized so that we can easily identify the relevant information and leave these data sets ready for the joining procedure.
In this part we will create three functions:
#Function to drop unnecessary columns
Drop_Cols <- function(df){
if('V1' %in% names(df)) {df <- df %>% dplyr::select(-V1)}
if('X1' %in% names(df)) {df <- df %>% dplyr::select(-X1)}
if('X' %in% names(df)) {df <- df %>% dplyr::select(-X)}
if('...1' %in% names(df)) {df <- df %>% dplyr::select(-...1)}
return(df)
}
#Function to transform date columns
Prod_date <- function(df){
starting_date <- as.Date("1970-01-01")
df %>%
mutate(Produktionsdatum = starting_date + Produktionsdatum_Origin_01011970) %>%
dplyr::select(-Produktionsdatum_Origin_01011970,-origin)
}
#Function to fix unstructured Data
Fix_unstruct_dat <- function(data){
#If the dataset has 14 columns we must convert it to only 7 columns:
if (ncol(data) == 14) {
df1 <- data[,1:7]
names(df1) <- sub("\\.x", "", names(df1))
df2 <- data[,8:14]
names(df2) <- sub("\\.y", "", names(df2))
#combine dataframes
df <- dplyr::bind_rows(df1, df2) %>% distinct() %>% drop_na(Werksnummer)
return(df)
#Same logic as in the second case when it has 21:
} else if(ncol(data) == 21) {
df1 <- data[,1:7]
names(df1) <- sub("\\.x", "", names(df1))
df2 <- data[,8:14]
names(df2) <- sub("\\.y", "", names(df2))
df3 <- data[,15:21]
# Combine Datasets
df <- dplyr::bind_rows(df1, df2, df3) %>% distinct() %>% drop_na(Werksnummer)
return(df)
} else {
print("Data cannot be processed with this function yet, prior changes are required!")
}
}
#Function to filter for company 217 and production dates indicated on the task
Filter_217_dates <- function(df){
df %>%
filter(Herstellernummer == 217)%>%
filter(Produktionsdatum >= "2012-06-01" & Produktionsdatum <= "2014-07-31")
}We intend to tidy the datasets such that they all repect the following conditions:
Thus, a dataset should meet three key points to be tidy:
#Geodata
geodata <- Drop_Cols(geodata)
# The first 846 obs of the dataset have only 4 digits but we know that in Germany Postal codes
#have 5 digits, thus we must transform the Postleitzahl variable
geodata$Postleitzahl <- as.character(geodata$Postleitzahl)
geodata$Postleitzahl[nchar(geodata$Postleitzahl) == 4] <-paste("0", geodata$Postleitzahl[nchar(geodata$Postleitzahl) == 4], sep = "")
glimpse(geodata)## Rows: 7,775
## Columns: 4
## $ Postleitzahl <chr> "01067", "01445", "01454", "01454", "01458", "01468", "01…
## $ Gemeinde <chr> "DRESDEN", "RADEBEUL", "RADEBERG", "WACHAU", "OTTENDORF O…
## $ Laengengrad <dbl> 13.73688, 13.65934, 13.92197, 13.90468, 13.83172, 13.6800…
## $ Breitengrad <dbl> 51.05170, 51.10598, 51.11640, 51.16012, 51.18762, 51.1594…
kable(head(geodata)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(geodata), width_max = "12em", width_min = "8em")| Postleitzahl | Gemeinde | Laengengrad | Breitengrad |
|---|---|---|---|
| 01067 | DRESDEN | 13.73688 | 51.05170 |
| 01445 | RADEBEUL | 13.65934 | 51.10598 |
| 01454 | RADEBERG | 13.92197 | 51.11640 |
| 01454 | WACHAU | 13.90468 | 51.16012 |
| 01458 | OTTENDORF OKRILLA | 13.83172 | 51.18762 |
| 01468 | MORITZBURG | 13.68009 | 51.15949 |
#Registrations
registrations <- Drop_Cols(registrations)
glimpse(registrations)## Rows: 3,204,104
## Columns: 3
## $ IDNummer <chr> "11-1-11-1", "11-1-11-2", "12-1-12-1", "12-1-12-2", "12-1-12…
## $ Gemeinden <chr> "DRESDEN", "DRESDEN", "LEIPZIG", "LEIPZIG", "DORTMUND", "DOR…
## $ Zulassung <date> 2009-01-01, 2009-01-01, 2009-01-01, 2009-01-01, 2009-01-01,…
kable(head(registrations)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(registrations), width_max = "12em", width_min = "8em")| IDNummer | Gemeinden | Zulassung |
|---|---|---|
| 11-1-11-1 | DRESDEN | 2009-01-01 |
| 11-1-11-2 | DRESDEN | 2009-01-01 |
| 12-1-12-1 | LEIPZIG | 2009-01-01 |
| 12-1-12-2 | LEIPZIG | 2009-01-01 |
| 12-1-12-3 | DORTMUND | 2009-01-01 |
| 12-1-12-4 | DORTMUND | 2009-01-01 |
#Cars
#Selection and Reordering of important columns:
order_car <-c("ID_Fahrzeug","Herstellernummer","Werksnummer")
car_type11 <- Drop_Cols(car_type11) %>% dplyr::select(order_car) %>% relocate(order_car)## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(order_car)` instead of `order_car` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
glimpse(car_type11)## Rows: 1,977,164
## Columns: 3
## $ ID_Fahrzeug <chr> "11-1-11-1", "11-1-11-2", "11-1-11-3", "11-1-11-4", "…
## $ Herstellernummer <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Werksnummer <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
kable(head(car_type11)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_type11), width_max = "12em", width_min = "8em")| ID_Fahrzeug | Herstellernummer | Werksnummer |
|---|---|---|
| 11-1-11-1 | 1 | 11 |
| 11-1-11-2 | 1 | 11 |
| 11-1-11-3 | 1 | 11 |
| 11-1-11-4 | 1 | 11 |
| 11-1-11-5 | 1 | 11 |
| 11-1-11-6 | 1 | 11 |
car_type12 <- Drop_Cols(car_type12) %>% dplyr::select(order_car) %>% relocate(order_car)
glimpse(car_type12)## Rows: 408,096
## Columns: 3
## $ ID_Fahrzeug <chr> "12-1-12-1", "12-1-12-2", "12-1-12-3", "12-1-12-4", "…
## $ Herstellernummer <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Werksnummer <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…
kable(head(car_type12)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_type12), width_max = "12em", width_min = "8em")| ID_Fahrzeug | Herstellernummer | Werksnummer |
|---|---|---|
| 12-1-12-1 | 1 | 12 |
| 12-1-12-2 | 1 | 12 |
| 12-1-12-3 | 1 | 12 |
| 12-1-12-4 | 1 | 12 |
| 12-1-12-5 | 1 | 12 |
| 12-1-12-6 | 1 | 12 |
car_type21 <- Drop_Cols(car_type21) %>% Prod_date() %>% dplyr::select(order_car) %>% relocate(order_car)
glimpse(car_type21)## Rows: 512,354
## Columns: 3
## $ ID_Fahrzeug <chr> "21-2-21-1", "21-2-21-2", "21-2-21-3", "21-2-21-4", "…
## $ Herstellernummer <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ Werksnummer <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 2…
kable(head(car_type21)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_type21), width_max = "12em", width_min = "8em")| ID_Fahrzeug | Herstellernummer | Werksnummer |
|---|---|---|
| 21-2-21-1 | 2 | 21 |
| 21-2-21-2 | 2 | 21 |
| 21-2-21-3 | 2 | 21 |
| 21-2-21-4 | 2 | 21 |
| 21-2-21-5 | 2 | 21 |
| 21-2-21-6 | 2 | 21 |
car_type22 <- Drop_Cols(car_type22) %>% Prod_date() %>% dplyr::select(order_car) %>% relocate(order_car)
glimpse(car_type22)## Rows: 306,490
## Columns: 3
## $ ID_Fahrzeug <chr> "22-2-21-1", "22-2-21-2", "22-2-21-3", "22-2-21-4", "…
## $ Herstellernummer <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ Werksnummer <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 2…
kable(head(car_type22)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_type22), width_max = "12em", width_min = "8em")| ID_Fahrzeug | Herstellernummer | Werksnummer |
|---|---|---|
| 22-2-21-1 | 2 | 21 |
| 22-2-21-2 | 2 | 21 |
| 22-2-21-3 | 2 | 21 |
| 22-2-21-4 | 2 | 21 |
| 22-2-21-5 | 2 | 21 |
| 22-2-21-6 | 2 | 21 |
rm(order_car)# Since we only care about the body parts, we only select the columns ID_Karosserie and ID_Fahrzeug
car_parts_type11 <- Drop_Cols(car_parts_type11) %>% dplyr::select(ID_Karosserie,ID_Fahrzeug)
glimpse(car_parts_type11)## Rows: 1,977,164
## Columns: 2
## $ ID_Karosserie <chr> "K4-112-1121-3", "K4-112-1121-4", "K4-112-1121-7", "K4-1…
## $ ID_Fahrzeug <chr> "11-1-11-1", "11-1-11-2", "11-1-11-3", "11-1-11-4", "11-…
kable(head(car_parts_type11)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_parts_type11), width_max = "12em", width_min = "8em")| ID_Karosserie | ID_Fahrzeug |
|---|---|
| K4-112-1121-3 | 11-1-11-1 |
| K4-112-1121-4 | 11-1-11-2 |
| K4-112-1121-7 | 11-1-11-3 |
| K4-112-1121-9 | 11-1-11-4 |
| K4-112-1121-11 | 11-1-11-5 |
| K4-112-1121-25 | 11-1-11-6 |
car_parts_type12 <- Drop_Cols(car_parts_type12) %>% dplyr::select(ID_Karosserie,ID_Fahrzeug)
glimpse(car_parts_type12)## Rows: 408,096
## Columns: 2
## $ ID_Karosserie <chr> "K5-112-1122-1", "K5-112-1122-11", "K5-112-1122-2", "K5-…
## $ ID_Fahrzeug <chr> "12-1-12-1", "12-1-12-2", "12-1-12-3", "12-1-12-4", "12-…
kable(head(car_parts_type12)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_parts_type12), width_max = "12em", width_min = "8em")| ID_Karosserie | ID_Fahrzeug |
|---|---|
| K5-112-1122-1 | 12-1-12-1 |
| K5-112-1122-11 | 12-1-12-2 |
| K5-112-1122-2 | 12-1-12-3 |
| K5-112-1122-3 | 12-1-12-4 |
| K5-112-1122-4 | 12-1-12-5 |
| K5-112-1122-6 | 12-1-12-6 |
car_parts_type21 <- Drop_Cols(car_parts_type21) %>% dplyr::select(ID_Karosserie,ID_Fahrzeug)
glimpse(car_parts_type21)## Rows: 512,354
## Columns: 2
## $ ID_Karosserie <chr> "K6-113-1132-1", "K6-113-1132-2", "K6-113-1132-8", "K6-1…
## $ ID_Fahrzeug <chr> "21-2-21-1", "21-2-21-2", "21-2-21-4", "21-2-21-5", "21-…
kable(head(car_parts_type21)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_parts_type21), width_max = "12em", width_min = "8em")| ID_Karosserie | ID_Fahrzeug |
|---|---|
| K6-113-1132-1 | 21-2-21-1 |
| K6-113-1132-2 | 21-2-21-2 |
| K6-113-1132-8 | 21-2-21-4 |
| K6-113-1132-9 | 21-2-21-5 |
| K6-113-1132-11 | 21-2-21-6 |
| K6-113-1132-12 | 21-2-21-7 |
car_parts_type22 <- Drop_Cols(car_parts_type22) %>% dplyr::select(ID_Karosserie,ID_Fahrzeug)
glimpse(car_parts_type22)## Rows: 306,490
## Columns: 2
## $ ID_Karosserie <chr> "K7-114-1142-1", "K7-114-1142-2", "K7-114-1142-5", "K7-1…
## $ ID_Fahrzeug <chr> "22-2-21-1", "22-2-21-2", "22-2-21-3", "22-2-21-4", "22-…
kable(head(car_parts_type22)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_parts_type22), width_max = "12em", width_min = "8em")| ID_Karosserie | ID_Fahrzeug |
|---|---|
| K7-114-1142-1 | 22-2-21-1 |
| K7-114-1142-2 | 22-2-21-2 |
| K7-114-1142-5 | 22-2-21-3 |
| K7-114-1142-8 | 22-2-21-4 |
| K7-114-1142-11 | 22-2-21-5 |
| K7-114-1142-13 | 22-2-21-6 |
# From the components data, we believe that only the columns ID_Karosserie will be relevant for our analysis and therefore we only select those variables
comp_4 <- Drop_Cols(comp_4) %>% Fix_unstruct_dat()%>% dplyr::select(ID_Karosserie)
glimpse(comp_4)## Rows: 1,977,164
## Columns: 1
## $ ID_Karosserie <chr> "K4-112-1121-3", "K4-112-1121-4", "K4-112-1121-7", "K4-1…
kable(head(comp_4)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_4), width_max = "12em", width_min = "8em")| ID_Karosserie |
|---|
| K4-112-1121-3 |
| K4-112-1121-4 |
| K4-112-1121-7 |
| K4-112-1121-9 |
| K4-112-1121-11 |
| K4-112-1121-25 |
comp_5 <- Drop_Cols(comp_5) %>% Fix_unstruct_dat() %>% dplyr::select(ID_Karosserie)
glimpse(comp_5)## Rows: 408,096
## Columns: 1
## $ ID_Karosserie <chr> "K5-112-1122-1", "K5-112-1122-11", "K5-112-1122-2", "K5-…
kable(head(comp_5)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_5), width_max = "12em", width_min = "8em")| ID_Karosserie |
|---|
| K5-112-1122-1 |
| K5-112-1122-11 |
| K5-112-1122-2 |
| K5-112-1122-3 |
| K5-112-1122-4 |
| K5-112-1122-6 |
comp_6 <- Drop_Cols(comp_6) %>% Prod_date() %>% dplyr::select(ID_Karosserie)
glimpse(comp_6)## Rows: 512,354
## Columns: 1
## $ ID_Karosserie <chr> "K6-113-1132-1", "K6-113-1132-2", "K6-113-1132-3", "K6-1…
kable(head(comp_6)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_6), width_max = "12em", width_min = "8em")| ID_Karosserie |
|---|
| K6-113-1132-1 |
| K6-113-1132-2 |
| K6-113-1132-3 |
| K6-113-1132-8 |
| K6-113-1132-9 |
| K6-113-1132-11 |
comp_7 <- Drop_Cols(comp_7) %>% Prod_date() %>% dplyr::select(ID_Karosserie)
glimpse(comp_7)## Rows: 306,490
## Columns: 1
## $ ID_Karosserie <chr> "K7-114-1142-1", "K7-114-1142-2", "K7-114-1142-5", "K7-1…
kable(head(comp_7)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_7), width_max = "12em", width_min = "8em")| ID_Karosserie |
|---|
| K7-114-1142-1 |
| K7-114-1142-2 |
| K7-114-1142-5 |
| K7-114-1142-8 |
| K7-114-1142-11 |
| K7-114-1142-13 |
comp_parts_4 <- Drop_Cols(comp_parts_4)
glimpse(comp_parts_4)## Rows: 1,977,164
## Columns: 4
## $ ID_T30 <chr> "30-216-2161-242", "30-216-2161-87", "30-217-2171-214", "30-216…
## $ ID_T31 <chr> "31-217-2172-302", "31-216-2161-242", "31-217-2172-53", "31-217…
## $ ID_T32 <chr> "32-216-2162-21", "32-216-2162-48", "32-216-2162-62", "32-216-2…
## $ ID_K4 <chr> "K4-112-1121-1", "K4-112-1121-2", "K4-112-1121-3", "K4-112-1121…
kable(head(comp_parts_4)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_parts_4), width_max = "12em", width_min = "8em")| ID_T30 | ID_T31 | ID_T32 | ID_K4 |
|---|---|---|---|
| 30-216-2161-242 | 31-217-2172-302 | 32-216-2162-21 | K4-112-1121-1 |
| 30-216-2161-87 | 31-216-2161-242 | 32-216-2162-48 | K4-112-1121-2 |
| 30-217-2171-214 | 31-217-2172-53 | 32-216-2162-62 | K4-112-1121-3 |
| 30-216-2161-144 | 31-217-2172-139 | 32-216-2162-134 | K4-112-1121-4 |
| 30-216-2161-71 | 31-216-2161-260 | 32-216-2162-153 | K4-112-1121-5 |
| 30-216-2161-214 | 31-216-2161-328 | 32-216-2162-165 | K4-112-1121-6 |
comp_parts_5 <- Drop_Cols(comp_parts_5)
glimpse(comp_parts_5)## Rows: 408,096
## Columns: 4
## $ ID_T30 <chr> "30-217-2171-67", "30-216-2162-3", "30-216-2162-10", "30-216-21…
## $ ID_T31 <chr> "31-216-2161-15", "31-216-2161-67", "31-216-2161-125", "31-216-…
## $ ID_T33 <chr> "33-217-2172-32", "33-217-2172-82", "33-216-2162-8", "33-216-21…
## $ ID_K5 <chr> "K5-112-1122-1", "K5-112-1122-2", "K5-112-1122-3", "K5-112-1122…
kable(head(comp_parts_5)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_parts_5), width_max = "12em", width_min = "8em")| ID_T30 | ID_T31 | ID_T33 | ID_K5 |
|---|---|---|---|
| 30-217-2171-67 | 31-216-2161-15 | 33-217-2172-32 | K5-112-1122-1 |
| 30-216-2162-3 | 31-216-2161-67 | 33-217-2172-82 | K5-112-1122-2 |
| 30-216-2162-10 | 31-216-2161-125 | 33-216-2162-8 | K5-112-1122-3 |
| 30-216-2162-11 | 31-216-2161-316 | 33-216-2162-11 | K5-112-1122-4 |
| 30-216-2162-50 | 31-217-2172-99 | 33-216-2162-12 | K5-112-1122-5 |
| 30-216-2162-53 | 31-217-2172-183 | 33-216-2162-14 | K5-112-1122-6 |
# We do not select the column ID_T36 as our company does not produce part 36
comp_parts_6 <- Drop_Cols(comp_parts_6) %>% dplyr::select(-ID_T36)
glimpse(comp_parts_6)## Rows: 512,354
## Columns: 4
## $ ID_T34 <chr> "34-217-2173-51", "34-217-2173-45", "34-218-2181-24", "34-217-2…
## $ ID_T35 <chr> "35-217-2173-9", "35-217-2173-64", "35-217-2173-107", "35-218-2…
## $ ID_T37 <chr> "37-218-2183-22", "37-218-2183-52", "37-218-2182-11", "37-217-2…
## $ ID_K6 <chr> "K6-113-1132-1", "K6-113-1132-2", "K6-113-1132-3", "K6-113-1132…
kable(head(comp_parts_6)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_parts_6), width_max = "12em", width_min = "8em")| ID_T34 | ID_T35 | ID_T37 | ID_K6 |
|---|---|---|---|
| 34-217-2173-51 | 35-217-2173-9 | 37-218-2183-22 | K6-113-1132-1 |
| 34-217-2173-45 | 35-217-2173-64 | 37-218-2183-52 | K6-113-1132-2 |
| 34-218-2181-24 | 35-217-2173-107 | 37-218-2182-11 | K6-113-1132-3 |
| 34-217-2173-36 | 35-218-2181-200 | 37-217-2173-6 | K6-113-1132-4 |
| 34-218-2181-65 | 35-217-2173-2 | 37-217-2173-7 | K6-113-1132-5 |
| 34-218-2181-139 | 35-217-2173-152 | 37-217-2173-10 | K6-113-1132-6 |
# We only select the columns of the parts produced by our company
comp_parts_7 <- Drop_Cols(comp_parts_7) %>% dplyr::select(ID_T34, ID_T35, ID_K7)
glimpse(comp_parts_7)## Rows: 306,490
## Columns: 3
## $ ID_T34 <chr> "34-217-2173-2", "34-217-2173-24", "34-218-2181-42", "34-218-21…
## $ ID_T35 <chr> "35-218-2181-58", "35-217-2173-92", "35-218-2181-2", "35-218-21…
## $ ID_K7 <chr> "K7-114-1142-1", "K7-114-1142-2", "K7-114-1142-3", "K7-114-1142…
kable(head(comp_parts_7)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_parts_7), width_max = "12em", width_min = "8em")| ID_T34 | ID_T35 | ID_K7 |
|---|---|---|
| 34-217-2173-2 | 35-218-2181-58 | K7-114-1142-1 |
| 34-217-2173-24 | 35-217-2173-92 | K7-114-1142-2 |
| 34-218-2181-42 | 35-218-2181-2 | K7-114-1142-3 |
| 34-218-2182-13 | 35-218-2181-12 | K7-114-1142-4 |
| 34-218-2182-45 | 35-218-2181-15 | K7-114-1142-5 |
| 34-217-2173-15 | 35-218-2181-16 | K7-114-1142-6 |
# We decided to rename the ID columns of each part dataset, because this will facilitate binding the rows later.
order_parts <- c("ID_Part", "Produktionsdatum", "Fehlerhaft", "Fehlerhaft_Datum",
"Fehlerhaft_Fahrleistung")
P_30 <- Drop_Cols(P_30) %>% Fix_unstruct_dat() %>% Filter_217_dates() %>%
rename(ID_Part = ID_T30) %>% dplyr::select(order_parts) %>% relocate(order_parts)## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(order_parts)` instead of `order_parts` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
glimpse(P_30)## Rows: 193,798
## Columns: 5
## $ ID_Part <chr> "30-217-2171-317715", "30-217-2171-317698", "3…
## $ Produktionsdatum <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0…
## $ Fehlerhaft_Datum <date> NA, NA, NA, NA, NA, NA, NA, 2012-12-17, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1181…
kable(head(P_30)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_30), width_max = "12em", width_min = "8em")| ID_Part | Produktionsdatum | Fehlerhaft | Fehlerhaft_Datum | Fehlerhaft_Fahrleistung |
|---|---|---|---|---|
| 30-217-2171-317715 | 2012-06-01 | 0 | NA | 0 |
| 30-217-2171-317698 | 2012-06-01 | 0 | NA | 0 |
| 30-217-2171-317776 | 2012-06-01 | 0 | NA | 0 |
| 30-217-2171-317706 | 2012-06-01 | 0 | NA | 0 |
| 30-217-2171-318036 | 2012-06-02 | 0 | NA | 0 |
| 30-217-2171-317926 | 2012-06-01 | 0 | NA | 0 |
P_31 <- Drop_Cols(P_31) %>% Prod_date() %>% Filter_217_dates() %>%
rename(ID_Part = ID_T31) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_31)## Rows: 258,679
## Columns: 5
## $ ID_Part <chr> "31-217-2172-424554", "31-217-2172-952233", "3…
## $ Produktionsdatum <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Fehlerhaft_Datum <date> NA, 2012-09-20, NA, NA, NA, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.000, 1142.860, 0.000, 0.000, 0.000, 0.000, 0…
kable(head(P_31)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_31), width_max = "12em", width_min = "8em")| ID_Part | Produktionsdatum | Fehlerhaft | Fehlerhaft_Datum | Fehlerhaft_Fahrleistung |
|---|---|---|---|---|
| 31-217-2172-424554 | 2012-06-01 | 0 | NA | 0.00 |
| 31-217-2172-952233 | 2012-06-01 | 1 | 2012-09-20 | 1142.86 |
| 31-217-2172-424488 | 2012-06-01 | 0 | NA | 0.00 |
| 31-217-2172-424684 | 2012-06-01 | 0 | NA | 0.00 |
| 31-217-2172-424465 | 2012-06-01 | 0 | NA | 0.00 |
| 31-217-2172-424686 | 2012-06-01 | 0 | NA | 0.00 |
P_32 <- Drop_Cols(P_32) %>% Fix_unstruct_dat() %>% Filter_217_dates() %>%
rename(ID_Part = ID_T32) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_32)## Rows: 268,191
## Columns: 5
## $ ID_Part <chr> "32-217-2171-440142", "32-217-2171-440262", "3…
## $ Produktionsdatum <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Fehlerhaft_Datum <date> NA, NA, NA, 2014-05-05, NA, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 54927.68, 0.00, 0.00, 0.00, …
kable(head(P_32)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_32), width_max = "12em", width_min = "8em")| ID_Part | Produktionsdatum | Fehlerhaft | Fehlerhaft_Datum | Fehlerhaft_Fahrleistung |
|---|---|---|---|---|
| 32-217-2171-440142 | 2012-06-01 | 0 | NA | 0.00 |
| 32-217-2171-440262 | 2012-06-01 | 0 | NA | 0.00 |
| 32-217-2171-440377 | 2012-06-01 | 0 | NA | 0.00 |
| 32-217-2171-440470 | 2012-06-02 | 1 | 2014-05-05 | 54927.68 |
| 32-217-2171-440083 | 2012-06-01 | 0 | NA | 0.00 |
| 32-217-2171-440187 | 2012-06-01 | 0 | NA | 0.00 |
P_33 <- Drop_Cols(P_33) %>% Prod_date() %>% Filter_217_dates() %>%
rename(ID_Part = ID_T33) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_33)## Rows: 66,443
## Columns: 5
## $ ID_Part <chr> "33-217-2172-108136", "33-217-2172-108116", "3…
## $ Produktionsdatum <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Fehlerhaft_Datum <date> NA, NA, NA, NA, NA, NA, 2013-11-01, NA, 2013-…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 38781.60, …
kable(head(P_33)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_33), width_max = "12em", width_min = "8em")| ID_Part | Produktionsdatum | Fehlerhaft | Fehlerhaft_Datum | Fehlerhaft_Fahrleistung |
|---|---|---|---|---|
| 33-217-2172-108136 | 2012-06-01 | 0 | NA | 0 |
| 33-217-2172-108116 | 2012-06-01 | 0 | NA | 0 |
| 33-217-2172-108084 | 2012-06-01 | 0 | NA | 0 |
| 33-217-2172-108091 | 2012-06-01 | 0 | NA | 0 |
| 33-217-2172-108094 | 2012-06-01 | 0 | NA | 0 |
| 33-217-2172-108097 | 2012-06-01 | 0 | NA | 0 |
P_34 <- Drop_Cols(P_34) %>% Prod_date() %>% Filter_217_dates() %>%
rename(ID_Part = ID_T34) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_34)## Rows: 67,222
## Columns: 5
## $ ID_Part <chr> "34-217-2173-109424", "34-217-2173-109442", "3…
## $ Produktionsdatum <date> 2012-06-01, 2012-06-01, 2012-06-02, 2012-06-0…
## $ Fehlerhaft <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Fehlerhaft_Datum <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
kable(head(P_34)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_34), width_max = "12em", width_min = "8em")| ID_Part | Produktionsdatum | Fehlerhaft | Fehlerhaft_Datum | Fehlerhaft_Fahrleistung |
|---|---|---|---|---|
| 34-217-2173-109424 | 2012-06-01 | 0 | NA | 0 |
| 34-217-2173-109442 | 2012-06-01 | 0 | NA | 0 |
| 34-217-2173-109459 | 2012-06-02 | 0 | NA | 0 |
| 34-217-2173-109483 | 2012-06-02 | 0 | NA | 0 |
| 34-217-2173-109531 | 2012-06-02 | 0 | NA | 0 |
| 34-217-2173-109432 | 2012-06-01 | 0 | NA | 0 |
P_35$Fehlerhaft_Fahrleistung.y <-as.double(P_35$Fehlerhaft_Fahrleistung.y)
P_35 <- Drop_Cols(P_35) %>% Fix_unstruct_dat() %>% Filter_217_dates() %>%
rename(ID_Part = ID_T35) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_35)## Rows: 133,669
## Columns: 5
## $ ID_Part <chr> "35-217-2173-218806", "35-217-2173-218777", "3…
## $ Produktionsdatum <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ Fehlerhaft_Datum <date> NA, NA, NA, NA, 2013-11-20, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 0.00, 37673.98, 0.00, 0.00, …
kable(head(P_35)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_35), width_max = "12em", width_min = "8em")| ID_Part | Produktionsdatum | Fehlerhaft | Fehlerhaft_Datum | Fehlerhaft_Fahrleistung |
|---|---|---|---|---|
| 35-217-2173-218806 | 2012-06-01 | 0 | NA | 0.00 |
| 35-217-2173-218777 | 2012-06-01 | 0 | NA | 0.00 |
| 35-217-2173-218800 | 2012-06-01 | 0 | NA | 0.00 |
| 35-217-2173-218975 | 2012-06-02 | 0 | NA | 0.00 |
| 35-217-2173-218818 | 2012-06-01 | 1 | 2013-11-20 | 37673.98 |
| 35-217-2173-218872 | 2012-06-01 | 0 | NA | 0.00 |
P_37 <- Drop_Cols(P_37) %>% Prod_date() %>% Filter_217_dates() %>%
rename(ID_Part = ID_T37) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_37)## Rows: 14,202
## Columns: 5
## $ ID_Part <chr> "37-217-2173-22140", "37-217-2173-22141", "37-…
## $ Produktionsdatum <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Fehlerhaft_Datum <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
kable(head(P_37)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_37), width_max = "12em", width_min = "8em")| ID_Part | Produktionsdatum | Fehlerhaft | Fehlerhaft_Datum | Fehlerhaft_Fahrleistung |
|---|---|---|---|---|
| 37-217-2173-22140 | 2012-06-01 | 0 | NA | 0 |
| 37-217-2173-22141 | 2012-06-01 | 0 | NA | 0 |
| 37-217-2173-22136 | 2012-06-01 | 0 | NA | 0 |
| 37-217-2173-22146 | 2012-06-01 | 0 | NA | 0 |
| 37-217-2173-22150 | 2012-06-01 | 0 | NA | 0 |
| 37-217-2173-22164 | 2012-06-02 | 0 | NA | 0 |
rm(order_parts)Now that we have successfully tidied, cleaned, and filtered the datasets, it is time to create the final dataset through a step-by-step joining procedure.
We make use of a left join using the community variables as key to extract the geolocation data and join it with the registration data of each car.
registrations <- registrations %>%
left_join(geodata, by = c('Gemeinden' = 'Gemeinde'))%>%
relocate(IDNummer, Zulassung, Gemeinden, Postleitzahl, Laengengrad,Breitengrad)
rm(geodata)
kable(sample_n(registrations,6)) %>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(registrations), width_max = "12em", width_min = "8em")| IDNummer | Zulassung | Gemeinden | Postleitzahl | Laengengrad | Breitengrad |
|---|---|---|---|---|---|
| 21-2-21-241800 | 2012-12-24 | MICHENDORF | 14552 | 13.028498 | 52.30963 |
| 22-2-21-58061 | 2016-12-02 | KARLSDORF NEUTHARD | 76689 | 8.545484 | 49.13615 |
| 22-2-22-171008 | 2014-10-02 | LUSTADT | 67363 | 8.274533 | 49.24452 |
| 11-1-11-1034907 | 2016-02-02 | BAIENFURT | 88255 | 9.651432 | 47.82698 |
| 21-2-21-123285 | 2011-02-21 | BIELEFELD | 33602 | 8.532120 | 52.02058 |
| 21-2-21-494007 | 2016-10-21 | OBERHAUSEN | 46045 | 6.852291 | 51.47032 |
In this part, we first join the data of each car type with the data of the parts used to produce that car joining by the variable ID_Fahrzeug. We then created a single dataset containing the data of all different car types. Finally, we join this cars dataset with the registration dataset yielding the following result.
car_type11 <- car_type11 %>% left_join(car_parts_type11, "ID_Fahrzeug")
rm(car_parts_type11)
car_type12 <- car_type12 %>% left_join(car_parts_type12, "ID_Fahrzeug")
rm(car_parts_type12)
car_type21 <- car_type21 %>% left_join(car_parts_type21, "ID_Fahrzeug")
rm(car_parts_type21)
car_type22 <- car_type22 %>% left_join(car_parts_type22, "ID_Fahrzeug")
rm(car_parts_type22)
cars <- bind_rows(car_type11,car_type12,car_type21,car_type22)
rm(car_type11,car_type12,car_type21,car_type22)
cars <- cars %>% left_join(registrations, by = c('ID_Fahrzeug' = 'IDNummer')) %>% distinct()
rm(registrations)
kable(sample_n(cars,6)) %>%
kable_styling(bootstrap_options = "striped",full_width = F,font_size = 9, position = "left") %>%
column_spec(1:ncol(cars), width_max = "12em", width_min = "8em")| ID_Fahrzeug | Herstellernummer | Werksnummer | ID_Karosserie | Zulassung | Gemeinden | Postleitzahl | Laengengrad | Breitengrad |
|---|---|---|---|---|---|---|---|---|
| 21-2-21-135783 | 2 | 21 | K6-113-1132-135777 | 2011-04-27 | ILVESHEIM | 68549 | 8.567892 | 49.47232 |
| 11-1-11-544405 | 1 | 11 | K4-114-1141-115751 | 2012-10-19 | PIRMASENS | 66953 | 7.608220 | 49.19741 |
| 22-2-22-21313 | 2 | 22 | K7-114-1142-27017 | 2009-09-01 | ORENHOFEN | 54298 | 6.650700 | 49.89688 |
| 11-1-12-376491 | 1 | 12 | K4-114-1141-157815 | 2012-12-19 | WOLFHAGEN | 34466 | 9.169659 | 51.32602 |
| 11-1-12-323907 | 1 | 12 | K4-114-1141-25715 | 2012-06-01 | KLEIN MECKELSEN | 27419 | 9.451117 | 53.30395 |
| 11-1-11-234328 | 1 | 11 | K4-112-1121-389796 | 2010-09-13 | MUENSTER | 48143 | 7.626804 | 51.96191 |
Before joining the components data with the cars data, we need to join the components with the data on parts used to build these components. However, when we do this, we will encounter a data frame containing a column for each part used to build the component. Instead, we want a single column for all parts. That is when we make use of the pivot_longer() function, which “lengthens” data, increasing the number of rows and decreasing the number of columns. In essence, it does the following:
Therefore, we create a special pivot function for the components data named Pivot_comp_parts:
Pivot_comp_parts <- function(df){
df %>%
pivot_longer(cols = starts_with("ID_T"), # Takes all the columns starting with ID_T
names_to = "Part_Type", # Classify them into a column named Part_Type
names_prefix = "ID_T", # Remove the prefix ID_T from Part_Type
values_to = "ID_Part") # Put the values of the selected colunms into ID_Part
}After creating the Pivot_comp_parts function we can proceed to join the data of the components with the data on parts used to build these components and finally we can left join the cars dataset yuilding the following result:
comp_4 <- comp_4 %>% left_join(comp_parts_4, by = c('ID_Karosserie' = 'ID_K4')) %>%
Pivot_comp_parts()
rm(comp_parts_4)
comp_5 <- comp_5 %>% left_join(comp_parts_5, by = c('ID_Karosserie' = 'ID_K5')) %>%
Pivot_comp_parts()
rm(comp_parts_5)
comp_6 <- comp_6 %>% left_join(comp_parts_6, by = c('ID_Karosserie' = 'ID_K6')) %>%
Pivot_comp_parts()
rm(comp_parts_6)
comp_7 <- comp_7 %>% left_join(comp_parts_7, by = c('ID_Karosserie' = 'ID_K7')) %>%
Pivot_comp_parts()
rm(comp_parts_7)
components <- bind_rows(comp_4,comp_5,comp_6,comp_7) %>% distinct()
rm(comp_4,comp_5, comp_6,comp_7)
components <- components %>% left_join(cars, "ID_Karosserie") %>% distinct()
rm(cars)
kable(sample_n(components,6))%>%
kable_styling(bootstrap_options = "striped",full_width = F,font_size = 8, position = "left") %>%
column_spec(1:ncol(components), width_max = "12em", width_min = "8em")| ID_Karosserie | Part_Type | ID_Part | ID_Fahrzeug | Herstellernummer | Werksnummer | Zulassung | Gemeinden | Postleitzahl | Laengengrad | Breitengrad |
|---|---|---|---|---|---|---|---|---|---|---|
| K4-112-1121-613177 | 30 | 30-217-2171-220546 | 11-1-12-242946 | 1 | 12 | 2011-08-10 | ALSHEIM | 67577 | 8.341322 | 49.76793 |
| K6-113-1132-236468 | 34 | 34-217-2173-111886 | 21-2-21-236465 | 2 | 21 | 2012-11-15 | LUCKENWALDE | 14943 | 13.173009 | 52.08925 |
| K5-112-1122-215996 | 30 | 30-217-2171-376277 | 12-1-12-216009 | 1 | 12 | 2013-05-10 | WAHRENHOLZ | 29399 | 10.603296 | 52.61567 |
| K4-114-1141-36505 | 30 | 30-216-2162-198939 | 11-1-11-496156 | 1 | 11 | 2012-06-15 | BAD ESSEN | 49152 | 8.343951 | 52.31962 |
| K4-112-1121-744525 | 30 | 30-216-2161-267766 | 11-1-12-295161 | 1 | 12 | 2012-02-17 | MAINZ | 55116 | 8.271855 | 50.00196 |
| K4-114-1141-933955 | 31 | 31-216-2161-1245910 | 11-1-11-1034010 | 1 | 11 | 2016-01-27 | WAGHAEUSEL | 68753 | 8.514287 | 49.25073 |
Creating a parts dataset is relatively simple as we have previously tidied, cleaned, and filtered all the parts. By applying bind_rows() to all the parts data sets and then erasing duplicates with distinct(), we get the parts dataset containing all the relevant parts for our analysis. Finally, to get the final dataset, which will be used for our shiny app, we simply need to left join the componets data set to the parts dataset by the created column ID_Part
parts <- bind_rows(P_30,P_31,P_32,P_33,P_34,P_35,P_37) %>% distinct()
rm(P_30,P_31,P_32,P_33,P_34,P_35,P_37)
Final_dataset_group_20 <- parts %>% left_join(components, "ID_Part") %>% distinct()
rm(parts, components)Finally, we will change the column names to their equivalent English translation for a matter of simplicity,reorder the dataset colunms, preview the head of the created dataset and save the entire dataset in the project folder as a CSV file.
Names_Dataset <- c("ID_Part","Production_Date_Part","Defective_Part", "Defective_Part_Date",
"Mileage_Part", "ID_Comp", "Part_Type", "ID_Car",
"OEM","Factory_Number","Registration",
"Community","Postal_Code", "Longitude", "Latitude")
names(Final_dataset_group_20) <- Names_Dataset
rm(Names_Dataset)
Final_dataset_group_20 <- Final_dataset_group_20 %>% relocate("ID_Part","Part_Type","Production_Date_Part","Defective_Part",
"Defective_Part_Date", "Mileage_Part", "ID_Comp", "ID_Car",
"OEM","Factory_Number","Registration","Community","Postal_Code",
"Longitude", "Latitude")
kable(sample_n(Final_dataset_group_20,6)) %>%
kable_styling(bootstrap_options = "striped",full_width = F,font_size = 6, position = "left") %>%
column_spec(1:ncol(Final_dataset_group_20), width_max = "12em", width_min = "8em")| ID_Part | Part_Type | Production_Date_Part | Defective_Part | Defective_Part_Date | Mileage_Part | ID_Comp | ID_Car | OEM | Factory_Number | Registration | Community | Postal_Code | Longitude | Latitude |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30-217-2171-406214 | 30 | 2013-05-29 | 0 | NA | 0 | K4-114-1141-337247 | 11-1-12-448152 | 1 | 12 | 2013-09-10 | EURASBURG | 82547 | 11.403471 | 47.85593 |
| 33-217-2172-131537 | 33 | 2013-03-10 | 0 | NA | 0 | K5-112-1122-221981 | 12-1-12-221886 | 1 | 12 | 2013-06-19 | UETZE | 31311 | 10.202538 | 52.46395 |
| 35-217-2173-331080 | 35 | 2014-03-31 | 0 | NA | 0 | K7-113-1132-54809 | 22-2-22-164500 | 2 | 22 | 2014-07-16 | MAINZ | 55116 | 8.271855 | 50.00196 |
| 31-217-2172-623162 | 31 | 2014-01-31 | 0 | NA | 0 | K5-113-1131-22543 | 12-1-12-267524 | 1 | 12 | 2014-05-08 | KOELN | 50667 | 6.957068 | 50.93811 |
| 34-217-2173-131875 | 34 | 2013-02-23 | 0 | NA | 0 | K7-113-1132-12479 | 22-2-22-131093 | 2 | 22 | 2013-06-10 | POESSNECK | 07381 | 11.593211 | 50.69355 |
| 33-217-2172-144253 | 33 | 2013-08-10 | 0 | NA | 0 | K5-112-1122-243386 | 12-1-12-243480 | 1 | 12 | 2013-11-22 | BAD STAFFELSTEIN | 96231 | 10.999853 | 50.10147 |
write.csv(Final_dataset_group_20,"Final_dataset_group_20.csv", row.names = FALSE)In this section we will shortly display the different tabs of our shiny app that was built using the final dataset presented in the previous section.
Here we briefly introduce the problem to enable the users of the app to understand the purpose for which the app was built.
We also included community specific popups into the app that display the following information:
In this tab, customers are able to view the quantities of parts supplied to the different plants of vehicle manufacturers OEM1 and OEM2. These include: 11-OEM:1, 12-OEM:1, 21-OEM:2 and 22-OEM:2.The production period can be broken down by the user into years, months, and weeks of the period in which non-compliant steel was used.
In this section, users are able to compare the locality, spread and skewness of the variable mileage,i.e, kilometers traveled before the part was found defective, for any two selectable communities.
Based on the vehicle ID, users can find out whether their car contained a part that was found defective or not, or even if their car was affected by this incident, i.e., if their car was contains a part that was produced with non-compliant steel or not.
Finally, in this section, the users will find the dataset used to build the shiny app. In this way, we allow our customers to have full transparency over the process.
In this section we will shortly discuss some of the insights gained through the development of our web application.
Throughout the heatmap of the damaged car’s hotspot in the second tab of our application, we were able to identify three specific federal states which were significantly affected by the use of non-compliant steel in our products:
North Rhine Westphalia: five cities in this state are among the top ten cities with the most cases of proven damaged cars. These include: Cologne, Dortmund, Bochum, Bonn, and Bielefeld. This insight leads us to the conclusion that this state is the primary hotspot for affected cars across the country.
Saxony: the state of Saxony takes second place as an affected car hotspot. Including two cities in the top ten, namely Dresden and Leipzig.
Bavaria: The southern state takes the third place, including cities like Munster and Augsburg.
The bar plot in our second tab of the web application delivered part quantities, allowed us to identify that the plants of the car manufacturer OEM1, namely, plants 11 and 12 received significantly more parts than the plants of the car manufacturer OEM2, plants 21 and 22, over the entire period of production. In particular, plant 12 received the major quantities of parts, while plant 22 was supplied with the lowest quantity of parts during this period.
This leads us to the assumption that the car manufacturer OEM1 should have been delivered with a major quantity of defective parts than its counterpart OEM2. We will test this assumption at the car manufacturer level as well as plant level to see if the data matches the information given by the bar plot.
Final_dataset_Group_20 <-read_csv("Final_dataset_Group_20.csv")
#Calculate the total number of defective parts delivered throughout the entire period. (will be used laster)
Tot_Num_Defective_Cases <- nrow(Final_dataset_Group_20 %>%
filter(Defective_Part == 1))
Manufacturer_Level<- Final_dataset_Group_20 %>%
filter(Defective_Part == 1)%>% #Filter for defective parts
group_by(OEM) %>% #Group by car manufacturer
count(OEM, sort =T)%>% # Count the number of cases by group
rename('Number of defective part cases' = n) %>% # Rename n variable
mutate('Share of defective parts delivered to the OEM' =
paste(round(((`Number of defective part cases`/Tot_Num_Defective_Cases)*100),2),"%", sep = "")) #Create a new variable which computes the share of defective parts by OEM
Plant_Level<- Final_dataset_Group_20 %>%
filter(Defective_Part == 1)%>% #Filter for defective parts
group_by(Factory_Number) %>% #Group by plant
count(Factory_Number, sort =T)%>% # Count the number of cases by group
rename('Number of defective part cases' = n) %>% # Rename n variable
rename('Plant Number' = Factory_Number) %>% # Rename Factory_Number variable
mutate('Share of defective parts delivered to the plant' =
paste(round(((`Number of defective part cases`/Tot_Num_Defective_Cases)*100),2),"%", sep = "")) #Create a new variable which computes the share of defective parts by OEM
rm(Tot_Num_Defective_Cases) #Not necessary anymore
kable(Manufacturer_Level)%>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "center") %>%
column_spec(1:ncol(Manufacturer_Level), width_max = "12em", width_min = "8em")| OEM | Number of defective part cases | Share of defective parts delivered to the OEM |
|---|---|---|
| 1 | 78948 | 78.42% |
| 2 | 21719 | 21.58% |
kable(Plant_Level)%>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "center") %>%
column_spec(1:ncol(Plant_Level), width_max = "12em", width_min = "8em")| Plant Number | Number of defective part cases | Share of defective parts delivered to the plant |
|---|---|---|
| 12 | 40088 | 39.82% |
| 11 | 38860 | 38.6% |
| 21 | 15630 | 15.53% |
| 22 | 6089 | 6.05% |
The data on the tables created above confirm our expectations. Now, we are in a position where we can confirm that OEM1 was not only supplied with a major quantity of parts during this period but also possessed a major number of defective parts than its counterpart OEM2. At the plant level, plant 12 had the largest share of defective parts, followed by plant 11. Consequently, plants 21 and 22 received a smaller share of defective parts, which aligns with the insights taken from the bar plot in our web application.
This information would suggest that car types of the manufacturer OEM1 would be more prone to getting rust damages from the use of non-compliant steel than cars of the manufacturer OEM2.
The vehicle’s useful life, considered in the analysis, corresponds to
the number of kilometres travelled before the defect was found.
Moreover, their visualization and the comparisons, based on the
registration community, are performed through the usage of boxplots.
Indeed, boxplots display the distribution of a set of data through five
key points. The minimum and maximum points are represented by the
horizontal lines, while the colored box represents the interquartile
range, and is limited by the first and third quartile, i.e. the values
under which the 25% and 75% of points are found. The bold line in the
middle of the box is the median, or second quartile. Finally, outliers
are shown as points.
This type of plot is useful for comparing the shape of several
distributions and assessing their symmetry, and allows to identify in a
reliable and precise way the presence of outliers.
However, the set of data of the case study contains also negative values for the mileage of the defective vehicles. This is clearly incorrect, therefore, the boxplots are not reliable and cannot be exploited for further analysis.